Methodology
For each historical backtest date, we:
- Filter data to models released before that date
- Fit an ECI (Epoch Capabilities Index) model using IRT
- Extract trend data for top-N models by ECI at release
- Fit a trend using either BIC model selection or Last-M-Months linear regression
- Forecast ECI forward and convert to benchmark scores via
sigmoid(slope × (ECI - EDI))
- Compare against actual SOTA at the target date
Trend Methods
- BIC: Compares linear vs piecewise (single breakpoint) models using Bayesian Information Criterion. Uses the slope after the breakpoint if piecewise is preferred.
- Last M Months: Simple linear regression on only the last M months of data, where M equals the forecast horizon.
Confidence Intervals
CIs are computed using prediction interval formulas that account for:
- Slope uncertainty from the linear regression
- Residual variance around the trend
Calibration Note: The current CIs are under-calibrated—approximately 53% of actuals fall within our 90% CIs (vs. the expected 90%). This is because we don't account for benchmark parameter uncertainty (EDI, slope) or model selection uncertainty. Future work could use bootstrap sampling for better-calibrated intervals.